Sepsis is a leading cause of global and hospital mortality without a gold standard diagnostic test available to date. Initial haemodynamic stabilisation and the timely administration of antibiotics improve a patient's outcome,1Evans L Rhodes A Alhazzani W et al.Surviving sepsis campaign: international guidelines for management of sepsis and septic shock 2021.Intensive Care Med. 2021; 47: 1181-1247Crossref PubMed Scopus (641) Google Scholar but the need for tools supporting the time-sensitive clinical sepsis diagnosis persists. The application of machine learning to electronic medical records (EMRs) as model training data is gaining popularity for the development of individual-level surveillance algorithms for sepsis in intensive care units, hospital wards, and emergency departments.2Kheterpal S Singh K Topol EJ Digitising the prediction and management of sepsis.Lancet. 2022; 3991459 Summary Full Text Full Text PDF Scopus (3) Google Scholar, 3Kollef MH Shorr AF Bassetti M et al.Timing of antibiotic therapy in the ICU.Crit Care. 2021; 25: 360Crossref Scopus (20) Google Scholar The time of sepsis onset is often identified and labelled retrospectively as the target outcome in a patient's EMR by use of the clinical criteria of Sepsis-1, 2, or 3. However, improvements in accelerating a sepsis diagnosis at the bedside through machine learning-based systems (rather than traditional early-warning systems) and the standard of care remain minimal.3Kollef MH Shorr AF Bassetti M et al.Timing of antibiotic therapy in the ICU.Crit Care. 2021; 25: 360Crossref Scopus (20) Google Scholar, 4Vistisen ST Pollard TJ Harris S Lauritsen SM Artificial intelligence in the clinical setting: towards actual implementation of reliable outcome predictions.Eur J Anaesthesiol. 2022; 39: 729-732Crossref Scopus (0) Google Scholar Therefore, the question arises as to whether the objective clinical criteria of Sepsis-1, 2, or 3 faithfully capture the target condition compared with the use of high-quality clinical ground truth as a reference standard, as is common practice in the analysis of medical images and text by machine learning.5Chen P-HC Mermel CH Liu Y Evaluation of artificial intelligence on a reference standard based on subjective interpretation.Lancet Digit Health. 2021; 3: e693-e695Summary Full Text Full Text PDF PubMed Google Scholar Both consensus definitions of sepsis were originally defined for their specific intended purposes. In the early 1990s, multicentre interventional trials targeting detrimental, excessive inflammation associated with sepsis required a broadly applicable case definition.6Balk RA Systemic inflammatory response syndrome (SIRS): where did it come from and is it still relevant today?.Virulence. 2014; 5: 20-26Crossref PubMed Scopus (206) Google Scholar As a result, systemic inflammatory response syndrome (SIRS) was introduced. SIRS in the presence of proven or suspected infection was, by expert consensus, defined as sepsis (Sepsis-1). In 2016, a critical care expert task force recommended that sepsis be defined by life-threatening organ damage caused by the host's response to infection (Sepsis-3).7Singer M Deutschman CS Seymour CW et al.The third international consensus definitions for sepsis and septic shock (Sepsis-3).JAMA. 2016; 315: 801-810Crossref PubMed Scopus (12769) Google Scholar This definition of Sepsis-3 was substantiated by perceived advances in pathobiology. In a companion investigation, patients were classified by Sepsis-3 on the basis of microbiological testing and antimicrobial therapy reflecting suspicion of infection and concurrent organ dysfunction as the clinical criteria to determine sepsis onset in EMRs.8Seymour CW Liu VX Iwashyna TJ et al.Assessment of clinical criteria for sepsis: for the third international consensus definitions for sepsis and septic shock (Sepsis-3).JAMA. 2016; 315: 762-774Crossref PubMed Google Scholar These clinical criteria have since been primarily applied as a measure of sepsis incidence for epidemiological studies, for which knowing the definite time of the clinical sepsis diagnosis is not essential. The clinical criteria of Sepsis-3 were not designed to accurately reflect the time of sepsis onset; however, these criteria—and those for Sepsis-1 and Sepsis-2—have been widely used to derive the putative time of sepsis onset in EMRs, with the aim of developing early-warning systems for sepsis through supervised machine learning, for when timeliness is of great importance.2Kheterpal S Singh K Topol EJ Digitising the prediction and management of sepsis.Lancet. 2022; 3991459 Summary Full Text Full Text PDF Scopus (3) Google Scholar, 3Kollef MH Shorr AF Bassetti M et al.Timing of antibiotic therapy in the ICU.Crit Care. 2021; 25: 360Crossref Scopus (20) Google Scholar, 4Vistisen ST Pollard TJ Harris S Lauritsen SM Artificial intelligence in the clinical setting: towards actual implementation of reliable outcome predictions.Eur J Anaesthesiol. 2022; 39: 729-732Crossref Scopus (0) Google Scholar Importantly, the clinical criteria of Sepsis-1, 2, and 3 were not devised for the early diagnosis of sepsis, and have not been validated in a clinical setting under the current standard of sepsis care.1Evans L Rhodes A Alhazzani W et al.Surviving sepsis campaign: international guidelines for management of sepsis and septic shock 2021.Intensive Care Med. 2021; 47: 1181-1247Crossref PubMed Scopus (641) Google Scholar Whether a warning system, which alerts with potentially excellent accuracy to a target outcome that has been retrospectively defined by these clinical criteria in the EMR, can also prospectively improve patient outcome through accelerating sepsis-specific therapy compared with current practice, is unknown. To establish an alternative diagnostic reference for sepsis, senior intensivists in our interdisciplinary surgical intensive care unit (University Medical Centre Mannheim, Mannheim, Germany) assigned daily ground truth labels for sepsis and sepsis-related conditions as working diagnoses to each patient as part of a questionnaire survey. Notwithstanding differences in individual experience and assessment, even among highly skilled experts, we reported excellent interrater agreement for sepsis and consistency of ground truth labels, with current knowledge of sepsis pathophysiology supporting their clinical pertinence.9Lindner HA Schamoni S Kirschning T et al.Ground truth labels challenge the validity of sepsis consensus definitions in critical illness.J Transl Med. 2022; 20: 27Crossref Scopus (1) Google Scholar The temporal agreement between sepsis onset as defined by clinical criteria for Sepsis-1, 2, and 3 and by ground truth as a reference standard was only fair (Krippendorff's α=0·4). This agreement was due to delayed or absent detection of a suspicion of infection by clinical criteria, whereas SIRS and organ dysfunction were both detected in virtually all patients when first assigned a ground truth label for sepsis. A machine learning model can only predict what it has learned to predict. Model training with clinical criteria that do not capture the bedside recognition of sepsis in a timely manner is unlikely to accelerate a diagnosis when compared with current practice. Ground truth, however, narrows down the short time window in which clinicians are tasked with ruling out a sepsis diagnosis and making far-reaching therapeutic decisions. Thereby, documented ground truth approximates the patient's point in time of concern towards the goal of improving the timeliness of sepsis recognition and is thus a clinically decisive reference. Objective clinical criteria, by contrast, merely constitute proxies for the detection of sepsis onset. The consistency and timeliness of their entries in a patient's EMR, reflecting suspicion of infection and SIRS or organ dysfunction, are technical premises that should first be validated in a given data source. However, clinical criteria and the corresponding rules for their derivation from the EMR cannot readily replicate the time-sensitive diagnosis of sepsis or its rejection resulting from the clinical-thinking process. In particular, the assessment as to whether an infection or other cause drives clinical worsening rationalises the clinician's decision on the precise diagnosis and therapy.9Lindner HA Schamoni S Kirschning T et al.Ground truth labels challenge the validity of sepsis consensus definitions in critical illness.J Transl Med. 2022; 20: 27Crossref Scopus (1) Google Scholar In summary, machine learning prediction models hold the promise to leverage implicit—but yet unrecognised—relationships in clinical data for expediting the diagnosis and improving the therapy for sepsis. These models should be trained with outcome labels that reflect the actual clinical judgement arising from a clinician's experience and assessment. Whether the clinical criteria of Sepsis-1, 2, and 3 fulfil this requirement is largely unknown, and they should not be used indiscriminately to this end. We argue that the consistency of prospectively collected, high-quality ground truth for sepsis with current clinical practice outweighs residual uncertainty in clinician judgement. Ground truth remains key to realising the potential of machine learning to provide actionable predictions and new insight into sepsis pathobiology. This potential extends to the use of ground truth to detect sepsis in subclasses of critical illness trajectories identified by unsupervised machine learning, which might inform predictive enrichment for clinical studies of novel sepsis markers and therapies.10Atreya MR Sanchez-Pinto LN Kamaleswaran R Commentary: ‘critical illness subclasses: all roads lead to Rome’.Crit Care. 2022; 26: 387Crossref Scopus (0) Google Scholar Machine learning prediction models should also evolve on the basis of current ground truth for training until a gold standard test to diagnose sepsis, and thus define the prediction outcome, is available. HAL and VS-L conceptualised the Comment. HAL drafted the manuscript. All authors contributed to the editing process. This work was supported by a grant from the Klaus Tschira Foundation, Germany, to all authors (project number 00.0277.2015). The funder had no role in the decision to publish or preparation of the manuscript. We declare no competing interests.